Analizando el Corpus

En primer lugar vamos a obtener algunas estadísticas sobre el corpus que nos permitan orientar el pre-procesamiento posterior. En concreto, nuestro interés debe centrarse en encontrar:

  • Títulos con un número elevado de reviews que puedan provocar bias
  • Terminología común a filtrar que no añade valor descriptivo semántico (ej: movie, film, good, bad...)
  • Pistas que nos indiquen que modelo de representación podría ser el más adecuado

Formato del Corpus

El corpus de Imdb consta de un total de 100K de reviews de películas extraídos de la base de datos de IMDB. Este dataset fue inicialmente diseñado para la investigación en Sentiment Analysis. Por este motivo, 50K de los 100K opiniones están etiquetas con la polaridad (en este caso en forma de nota a la película por el usuario), 25K de ellos corresponden a reviews con opiniones positivas y otros 25K a reviews con opiniones negativas. Igualmente, el dataset está balanceado en cuanto a muestras de entrenamiento y test, 25K para training y 25K para test. Los 50k restante no están etiquetados y están pensados para pruebas de unsupervising learning como es nuestro caso.

En primer lugar vamos a ver cual es la distribución de reviews por películas. Nuestro objetivo es ver si el dataset está balanceado en este sentido ya que aquellas películas con gran número de opiniones podrían crear un sesgo y tendríamos que buscar alguna estrategia para balancearlo nosotros mismos.

Cada review viene en un fichero .txt individual cuyo nombre tiene el siguiente formato: [id]_[rating].txt. El id es único por review. Para cada grupo de reviews existe un fichero [urls_[pos, neg, unsup].txt en el que cada línea contiene el identificador de la película con ID igual al número de línea. Es decir, la línea 0 de este fichero contiene el identificador (URL de IMDB) de la película cuya review pertenece al fichero 0_[rating].txt

In [3]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import os
from collections import Counter
import re
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pprint
from nltk import FreqDist
import gensim
from sklearn.feature_extraction.text import HashingVectorizer
    
vectorizer = HashingVectorizer(stop_words='english', strip_accents='unicode')
analyzer = vectorizer.build_analyzer()

id_pattern = re.compile('title\/(.*)\/')
pp = pprint.PrettyPrinter(width=100, compact=True)

def walk_corpus(path, pattern):
    import fnmatch
    for root, dirnames, filenames in os.walk(path):
        for filename in fnmatch.filter(filenames, pattern):
            yield os.path.join(root, filename)
                
def corpus_stats(path, pattern):
    cnt = Counter()
    for urls_file in walk_corpus(path, pattern):
        with open(urls_file) as f:
            lines = f.readlines()
            for line in lines:
                movie_id = id_pattern.search(line).group(1)
                cnt[movie_id] += 1
            
    distinct = len(cnt)
    size = sum(cnt.values())
    print('Number of different movies reviewed: {}'.format(distinct))
    print('Number of total reviews: {}'.format(size))
    print('Reviews per movie Average: {}'.format(size/distinct))

    df = pd.DataFrame.from_dict(cnt, orient='index').reset_index()
    ax = df.hist(grid=False)
    ax[0][0].set_xlabel("Number of Reviews", labelpad=20, weight='bold', size=12)
    ax[0][0].set_ylabel("Number of Movies", labelpad=20, weight='bold', size=12)

Distribución de Reviews en el corpus completo

In [2]:
corpus_stats('./resources/aclImdb/', 'urls_*.txt')
Number of different movies reviewed: 14127
Number of total reviews: 100000
Reviews per movie Average: 7.078643731860976

Distribucion de Reviews en datos no etiquetados

In [3]:
corpus_stats('./resources/aclImdb/train/', 'urls_unsup*')
Number of different movies reviewed: 7091
Number of total reviews: 50000
Reviews per movie Average: 7.051191651389085

Distribucion de Reviews en datos de test

In [4]:
corpus_stats('./resources/aclImdb/test/', 'urls_*.txt')
Number of different movies reviewed: 3581
Number of total reviews: 25000
Reviews per movie Average: 6.9812901424183185

Estadísticas de Texto

En este apartado vamos a analizar con ayuda de NLTK como es la distribución del texto a lo largo de todo el corpus. Estaremos interesados en caracteríticas como tokens más frecuentes, longitud del corpus, longitud del vocabulario, etcétera.

In [55]:
def tokenize_corpus(path, pattern, mode='d'):
    for corpus_file in walk_corpus(path, pattern):
        with open(corpus_file, 'r') as next_file:
            next_review = next_file.read()
            tokens = analyzer(next_review)
            if mode == 'd':
                yield tokens
            else:
                for token in tokens:
                    yield token

Distribución de Frecuencias

La mayoría de herramientas que trabajan con texto en python (NLTK, gensim, Scikit Learn...) necesitan manejar una estructura de datos en la que se implementa una distribución de frecuencias que da lugar a una representación conocida como Bag of Words (BoW) en la que simplemente, por cada documento o a nivel global del corpus, se mantiene un contador con el número de apariciones de cada palabra o token

In [45]:
%time dist = FreqDist(tokenize_corpus('./resources/aclImdb/all/', '*.txt', mode='t'))
CPU times: user 27 s, sys: 15.1 s, total: 42.1 s
Wall time: 3min 32s
In [16]:
print(dist)
df = pd.DataFrame(dist.most_common(100))
df.columns = ['Token', 'Frecuencia']
df.head(20)
<FreqDist with 140536 samples and 11179231 outcomes>
Out[16]:
Token Frecuencia
0 br 406911
1 movie 176171
2 film 160801
3 like 80889
4 just 71006
5 good 59495
6 time 50228
7 story 46706
8 really 46282
9 bad 37542
10 people 36786
11 great 36536
12 don 35188
13 make 32393
14 way 31710
15 movies 30776
16 characters 29143
17 think 28336
18 character 28265
19 films 28034
In [18]:
df = pd.DataFrame(dist.most_common()[-100:])
df.columns = ['Token', 'Frecuencia']
df.head(20)
Out[18]:
Token Frecuencia
0 caisse 1
1 peli 1
2 3462 1
3 strutts 1
4 exhude 1
5 finnlayson 1
6 mmb 1
7 marthesheimer 1
8 trattoria 1
9 mjyoung 1
10 fluidic 1
11 liberator 1
12 korzeniowsky 1
13 oculist 1
14 burrowes 1
15 kornhauser 1
16 gerri 1
17 theirry 1
18 tatta 1
19 commitophobe 1

Diseñando Nuestro Modelo

Diccionario

Nuestro modelo de tópicos estará basado en una representación BoW del corpus. Únicamente tendremos en cuenta la frecuencia global de los términos y no una frecuencia de documentos tipo TF-IDF. Lo primero que necesitamos construir es un diccionario con nuestro vocabulario. Empezaremos con un vocabulario sin filtros, para comprobar que resultamos obtenemos y si nuestro estudio previo ha tenido sentido a la hora de ayudarnos con el filtrado posterior.

Empezamos a utilizar gensim para construir el diccionario.

In [136]:
stream = tokenize_corpus('./resources/aclImdb/all/', '*.txt')
%time dictionary = gensim.corpora.Dictionary(stream)
dictionary.save('original.dict')
CPU times: user 43.6 s, sys: 19 s, total: 1min 2s
Wall time: 4min 54s
In [10]:
data = [[dictionary.num_docs, dictionary.num_pos, len(dictionary.token2id)]]
df = pd.DataFrame(data)
df.columns=['Numero de reviews analizadas', 'Numero de tokens analizados', 'Numero de tokens únicos actuales']
df.head()
Out[10]:
Numero de reviews analizadas Numero de tokens analizados Numero de tokens únicos actuales
0 100000 11179231 140536

Corpus

Necesitamos declarar un iterable para acceder en streaming a la representación BoW de cada uno de nuestros documentos (reviews). Este iterable será utilizado de forma eficiente por gensim para entrenar el modelo de forma iterativa en un número determinado de pasadas.

In [14]:
class MovieCorpus(object):

    def __init__(self, path, dictionary):
        self.__path = path
        self.__dictionary = dictionary

    def __iter__(self):
        for tokens in tokenize_corpus(self.__path, '*.txt'):
            yield self.__dictionary.doc2bow(tokens)

    def __len__(self):
        return len(self.__dictionary)
In [16]:
def explore_topic(lda_model, topic_number, topn, output=True):
    """
    accept a ldamodel, atopic number and topn vocabs of interest
    prints a formatted list of the topn terms
    """
    terms = []
    for term, frequency in lda_model.show_topic(topic_number, topn=topn):
        terms += [term]
        if output:
            print(u'{:20} {:.3f}'.format(term, round(frequency, 3)))
    
    return terms

def print_lda_model(lda_model, num_topics=20):
    topic_summaries = []
    print(u'{:20} {}'.format(u'term', u'frequency') + u'\n')
    for i in range(num_topics):
        print('\n')
        print('Topic '+str(i)+' |---------------------\n')
        tmp = explore_topic(lda_model,topic_number=i, topn=10, output=True )
        topic_summaries += [tmp[:5]]
        print
In [71]:
dictionary = gensim.corpora.Dictionary.load('original.dict')
corpus = MovieCorpus("./resources/aclImdb/all", dictionary)
%time lda_model = gensim.models.ldamodel.LdaModel(corpus, num_topics=20, id2word=dictionary)
CPU times: user 2min 42s, sys: 45.2 s, total: 3min 27s
Wall time: 10min 10s
In [55]:
print_lda_model(lda_model)
term                 frequency



Topic 0 |---------------------

br                   0.010
john                 0.006
joe                  0.006
town                 0.005
tony                 0.005
western              0.005
old                  0.004
time                 0.004
man                  0.004
harry                0.004


Topic 1 |---------------------

new                  0.007
school               0.006
br                   0.006
york                 0.005
film                 0.005
high                 0.004
allen                0.004
welles               0.004
best                 0.003
young                0.003


Topic 2 |---------------------

life                 0.016
story                0.009
family               0.009
man                  0.008
young                0.008
love                 0.007
father               0.007
film                 0.007
people               0.006
world                0.006


Topic 3 |---------------------

film                 0.018
role                 0.012
performance          0.011
great                0.010
cast                 0.010
best                 0.008
actor                0.006
plays                0.006
good                 0.006
character            0.006


Topic 4 |---------------------

br                   0.042
film                 0.008
plot                 0.005
character            0.004
murder               0.004
police               0.004
does                 0.003
just                 0.003
time                 0.003
end                  0.003


Topic 5 |---------------------

film                 0.075
films                0.014
story                0.009
characters           0.007
like                 0.006
time                 0.006
director             0.006
seen                 0.005
great                0.005
work                 0.005


Topic 6 |---------------------

tom                  0.008
matthau              0.007
ben                  0.006
cagney               0.006
new                  0.006
jack                 0.006
stanwyck             0.006
ned                  0.006
julie                0.006
daniel               0.005


Topic 7 |---------------------

movie                0.073
bad                  0.018
just                 0.017
like                 0.017
good                 0.015
movies               0.015
watch                0.011
time                 0.011
don                  0.010
acting               0.010


Topic 8 |---------------------

batman               0.012
marie                0.011
dr                   0.011
series               0.007
christopher          0.007
doctor               0.006
karloff              0.006
werewolf             0.006
lee                  0.006
dracula              0.005


Topic 9 |---------------------

kids                 0.013
original             0.011
years                0.011
old                  0.011
animation            0.009
children             0.009
dvd                  0.009
disney               0.007
great                0.007
movie                0.007


Topic 10 |---------------------

series               0.016
episode              0.013
br                   0.009
fi                   0.008
sci                  0.008
earth                0.008
space                0.007
episodes             0.006
science              0.006
season               0.006


Topic 11 |---------------------

horror               0.027
film                 0.015
br                   0.013
blood                0.007
films                0.007
gore                 0.007
good                 0.006
like                 0.006
pretty               0.005
house                0.005


Topic 12 |---------------------

br                   0.057
movie                0.038
like                 0.015
just                 0.014
really               0.012
good                 0.012
great                0.009
think                0.008
don                  0.007
love                 0.006


Topic 13 |---------------------

br                   0.282
movie                0.006
people               0.005
story                0.005
time                 0.005
world                0.004
american             0.003
history              0.003
like                 0.003
film                 0.002


Topic 14 |---------------------

book                 0.009
version              0.007
novel                0.007
musical              0.006
jane                 0.005
br                   0.005
play                 0.005
comedy               0.005
mr                   0.005
great                0.004


Topic 15 |---------------------

action               0.030
war                  0.022
fight                0.011
movie                0.010
good                 0.007
battle               0.006
fighting             0.006
scenes               0.006
like                 0.005
movies               0.005


Topic 16 |---------------------

series               0.022
match                0.017
anime                0.010
episodes             0.010
morgan               0.008
br                   0.008
vs                   0.006
episode              0.006
freeman              0.006
miike                0.006


Topic 17 |---------------------

game                 0.013
like                 0.010
scene                0.007
guy                  0.006
house                0.006
dead                 0.005
man                  0.005
gets                 0.005
looks                0.004
look                 0.004


Topic 18 |---------------------

king                 0.019
rock                 0.018
band                 0.015
dennis               0.009
harris               0.007
ed                   0.007
metal                0.007
lugosi               0.007
astaire              0.006
music                0.005


Topic 19 |---------------------

french               0.006
paris                0.006
la                   0.003
crawford             0.003
andre                0.003
garbo                0.003
le                   0.002
love                 0.002
kurosawa             0.002
jean                 0.002
In [85]:
gensim.corpora.MmCorpus.serialize('corpus.mm', corpus)
In [86]:
corpus = gensim.corpora.MmCorpus('corpus.mm')
import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
vis
/opt/conda/lib/python3.6/site-packages/pyLDAvis/_prepare.py:257: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  return pd.concat([default_term_info] + list(topic_dfs))
Out[86]:

Evaluando Modelo LDA

Necesitamos poder comparar los distintos modelos que vamos a ir generando para poder comprobar que van mejorando con las acciones que tomamos. Existen muchas diversas formas de evaluar un modelo LDA, cualquiera compatible con evaluar clusters procedentes de algoritmos de clustering.

Los clusters se suelen evaluar midiendo la coherencia de sus componentes. En nuestro caso concreto, cada tópico tendrá mayor calidad si:

  • Los documentos dominados por los mismos tópicos han de ser similares entre si
  • Los documentos dominados por tópicos diferentes y poco solapados han de ser distintos entre si

Afortunadamente gensim proporciona sus propias herramientas para medir la coherencia que usamos a continuación.

In [72]:
from gensim.models.coherencemodel import CoherenceModel
cm = CoherenceModel(model=lda_model, corpus=corpus, coherence='u_mass')
cm.get_coherence_per_topic()
Out[72]:
[-2.9277347181684994,
 -1.6110987847944827,
 -1.7473205556728812,
 -1.4248872582976961,
 -4.3538053341423373,
 -3.3073109403726977,
 -3.5863808476146426,
 -2.0216863193684409,
 -1.2965622240480907,
 -2.0969289894803977,
 -2.6078914644280973,
 -3.089402851685418,
 -3.0390617215163767,
 -3.4810382818397834,
 -3.1906246894280708,
 -3.3149978646861418,
 -3.7799615695211823,
 -2.9833106936601888,
 -5.2358393181978586,
 -1.4855322610999757]

Filtrando Tokens Frecuentes

In [56]:
mc100 = [mc[0] for mc in dist.most_common(100)]
terms_id = lda_model.get_topic_terms(3)
terms_str = [dictionary.id2token[id[0]] for id in terms_id if id[0] in dictionary.id2token]
list(set(mc100) & set(terms_str))
Out[56]:
['role', 'film', 'good', 'best', 'great', 'cast', 'performance', 'character']
In [56]:
def dictionary_filter_most_frequent(dictionary, dist, n=200):
    most_common = dist.most_common(n)
    mc_ids = [dictionary.token2id[t[0]] for t in most_common]
    dictionary.filter_tokens(bad_ids=mc_ids)
    dictionary.compactify()
In [90]:
print('Longitud del vocabulario actual: {}'.format(len(dictionary.token2id)))
dictionary_filter_most_frequent(dictionary, dist)
# Filter out words that occur less than 10 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=10, no_above=0.5)
print('Longitud del vocabulario Filtrado: {}'.format(len(dictionary.token2id)))
Longitud del vocabulario actual: 140536
Longitud del vocabulario Filtrado: 35598

Normalizando Tokens

Además del lowercase, vamos a eliminar también los plurales

In [57]:
def normalize_dictionary(dictionary):
    from textblob import Word
    plurals = []
    for token in dictionary.values():
        if token.endswith('s'):
            singular = Word(token).singularize()
            if token != singular:
                singular_id = dictionary.token2id.get(singular, None)
                if singular_id:
                    plurals.append(dictionary.token2id[token])
                    
    dictionary.filter_tokens(bad_ids=plurals)
    dictionary.compactify()
    return plurals
In [92]:
print('Numero de tokens únicos actuales: {}'.format(len(dictionary.token2id)))
plurals = normalize_dictionary(dictionary)
print('Numero de plurales detectados: {}'.format(len(plurals)))
print('Numero de tokens únicos actuales: {}'.format(len(dictionary.token2id)))
Numero de tokens únicos actuales: 35598
Numero de plurales detectados: 5173
Numero de tokens únicos actuales: 30425
In [93]:
dictionary.save('normalized.v1.dict')
In [94]:
dictionary = gensim.corpora.Dictionary.load('normalized.v1.dict')
corpus2 = MovieCorpus("./resources/aclImdb/all", dictionary)
gensim.corpora.MmCorpus.serialize('corpus2.mm', corpus2)
corpus2 = gensim.corpora.MmCorpus('corpus2.mm')
%time lda_model = gensim.models.ldamodel.LdaModel(corpus2, num_topics=20, id2word=dictionary)
print_lda_model(lda_model)
CPU times: user 1min 3s, sys: 0 ns, total: 1min 3s
Wall time: 1min 2s
term                 frequency



Topic 0 |---------------------

art                  0.006
style                0.005
amazing              0.005
dialogue             0.004
brilliant            0.004
cinematography       0.004
cinema               0.004
drama                0.004
direction            0.004
written              0.003


Topic 1 |---------------------

episode              0.023
jack                 0.012
captain              0.006
anime                0.005
ship                 0.005
sam                  0.005
army                 0.004
crew                 0.004
air                  0.004
sky                  0.003


Topic 2 |---------------------

documentary          0.009
history              0.007
french               0.006
cinema               0.005
country              0.005
political            0.005
british              0.003
america              0.003
power                0.003
social               0.003


Topic 3 |---------------------

police               0.007
car                  0.006
town                 0.005
room                 0.005
allen                0.005
starts               0.004
killer               0.004
girls                0.004
finds                0.004
getting              0.004


Topic 4 |---------------------

prison               0.008
japanese             0.008
white                0.004
kill                 0.004
hero                 0.004
group                0.004
gun                  0.004
president            0.004
attack               0.004
drug                 0.003


Topic 5 |---------------------

fight                0.010
chinese              0.009
martial              0.008
fu                   0.006
morgan               0.006
kung                 0.006
jackie               0.006
chan                 0.006
luke                 0.005
china                0.004


Topic 6 |---------------------

heart                0.006
mother               0.005
relationship         0.004
wonderful            0.004
perfect              0.004
strong               0.003
lives                0.003
emotional            0.003
journey              0.003
moving               0.003


Topic 7 |---------------------

tom                  0.006
wonderful            0.005
oscar                0.005
romantic             0.005
stewart              0.004
perfect              0.004
ben                  0.003
danny                0.003
ford                 0.003
career               0.003


Topic 8 |---------------------

fi                   0.014
sci                  0.014
space                0.013
season               0.012
earth                0.012
science              0.010
planet               0.009
future               0.008
fiction              0.008
wars                 0.007


Topic 9 |---------------------

lee                  0.012
doctor               0.010
dr                   0.008
peter                0.008
jones                0.004
hospital             0.004
master               0.004
lucy                 0.004
keaton               0.004
ritter               0.004


Topic 10 |---------------------

game                 0.012
blood                0.008
gore                 0.008
fans                 0.006
genre                0.006
killer               0.005
zombie               0.005
flick                0.005
evil                 0.005
creepy               0.005


Topic 11 |---------------------

husband              0.009
harry                0.007
mary                 0.006
daughter             0.005
robert               0.005
william              0.005
tony                 0.004
arthur               0.004
joan                 0.004
mother               0.004


Topic 12 |---------------------

animation            0.012
disney               0.011
animated             0.009
children             0.005
robin                0.005
adventure            0.005
powell               0.005
king                 0.004
voice                0.004
douglas              0.004


Topic 13 |---------------------

jane                 0.009
murder               0.009
noir                 0.007
michael              0.006
thriller             0.005
novel                0.005
crime                0.005
paul                 0.005
sexual               0.005
hitchcock            0.004


Topic 14 |---------------------

laugh                0.005
mean                 0.005
oh                   0.005
couldn               0.005
video                0.005
terrible             0.005
guys                 0.004
awful                0.004
horrible             0.004
waste                0.004


Topic 15 |---------------------

bond                 0.011
christmas            0.008
indian               0.008
ghost                0.005
hoffman              0.005
emma                 0.005
james                0.004
david                0.004
mr                   0.004
australian           0.004


Topic 16 |---------------------

king                 0.013
team                 0.005
baseball             0.005
lugosi               0.005
george               0.005
football             0.004
jr                   0.004
lion                 0.004
ed                   0.004
karloff              0.004


Topic 17 |---------------------

humor                0.007
loved                0.007
charlie              0.006
favorite             0.006
cartoon              0.005
smith                0.005
today                0.005
remember             0.005
hilarious            0.005
adam                 0.005


Topic 18 |---------------------

human                0.007
children             0.005
lives                0.004
person               0.004
message              0.004
child                0.004
understand           0.003
god                  0.003
viewer               0.003
self                 0.003


Topic 19 |---------------------

musical              0.018
dance                0.016
kelly                0.013
song                 0.011
dancing              0.011
singing              0.009
welles               0.008
sinatra              0.007
stage                0.007
mgm                  0.006
In [95]:
vis = pyLDAvis.gensim.prepare(lda_model, corpus2, dictionary)
vis
/opt/conda/lib/python3.6/site-packages/pyLDAvis/_prepare.py:257: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  return pd.concat([default_term_info] + list(topic_dfs))
Out[95]:
In [96]:
cm = CoherenceModel(model=lda_model, corpus=corpus, coherence='u_mass')
cm.get_coherence_per_topic()
Out[96]:
[-2.7249953664705537,
 -4.8343158848080261,
 -2.9922984556968912,
 -3.2240404173821271,
 -3.309032242410932,
 -7.5847822349995209,
 -2.773059637601031,
 -3.0913801380667345,
 -3.105118440353126,
 -6.4102889970390464,
 -2.804127426649357,
 -3.2114181728942688,
 -4.9062036037519539,
 -4.1926180177013377,
 -2.6366711760524701,
 -5.5994307725766239,
 -5.1554215835064099,
 -3.1388181824501715,
 -2.9103682298132827,
 -3.351591545253445]

Limitando el Vocabulario por Frecuencia

In [25]:
def dictionary_keep_n_frequent(dictionary, dist, n=5000):
    tokens_by_freq = dist.most_common(len(dist))
    mf = []
    for t in tokens_by_freq:
        id = dictionary.token2id.get(t[0], None)
        if id:
            mf.append(id)
            if len(mf) == n:
                break
    dictionary.filter_tokens(good_ids=mf)

10 Topics

In [100]:
dictionary = gensim.corpora.Dictionary.load('normalized.v1.dict')
dictionary_keep_n_frequent(dictionary, dist)
corpus = MovieCorpus("./resources/aclImdb/all", dictionary)
gensim.corpora.MmCorpus.serialize('corpus3.mm', corpus)
corpus = gensim.corpora.MmCorpus('corpus3.mm')
%time lda_model= gensim.models.ldamodel.LdaModel(corpus, num_topics=10, id2word=dictionary)
print_lda_model(lda_model, 10)
CPU times: user 58.9 s, sys: 0 ns, total: 58.9 s
Wall time: 59.1 s
term                 frequency



Topic 0 |---------------------

episode              0.013
car                  0.008
season               0.007
guys                 0.006
kill                 0.004
killed               0.004
head                 0.004
police               0.004
gun                  0.004
says                 0.004


Topic 1 |---------------------

laugh                0.006
terrible             0.005
video                0.005
awful                0.005
mean                 0.005
waste                0.004
couldn               0.004
absolutely           0.004
lines                0.004
horrible             0.004


Topic 2 |---------------------

japanese             0.006
bruce                0.005
style                0.005
feature              0.004
silent               0.004
footage              0.004
welles               0.004
comic                0.004
al                   0.003
christmas            0.003


Topic 3 |---------------------

game                 0.009
joe                  0.006
george               0.006
tom                  0.006
jack                 0.005
town                 0.005
scott                0.005
mary                 0.005
lee                  0.004
harry                0.004


Topic 4 |---------------------

song                 0.011
musical              0.008
rock                 0.008
animation            0.008
voice                0.007
dance                0.006
disney               0.006
king                 0.006
match                0.006
animated             0.006


Topic 5 |---------------------

fi                   0.010
sci                  0.010
earth                0.009
space                0.008
monster              0.008
science              0.008
dr                   0.008
human                0.007
planet               0.006
island               0.006


Topic 6 |---------------------

mother               0.008
children             0.007
child                0.006
lives                0.006
boy                  0.006
heart                0.005
loved                0.005
son                  0.005
relationship         0.004
live                 0.004


Topic 7 |---------------------

killer               0.011
murder               0.007
thriller             0.007
blood                0.007
gore                 0.006
genre                0.006
violence             0.006
dark                 0.005
creepy               0.005
scary                0.005


Topic 8 |---------------------

wonderful            0.007
paul                 0.007
james                0.007
mr                   0.006
robert               0.006
roles                0.005
supporting           0.005
kelly                0.005
oscar                0.005
stewart              0.005


Topic 9 |---------------------

history              0.004
cinema               0.003
documentary          0.003
french               0.003
novel                0.003
art                  0.003
based                0.003
viewer               0.003
drama                0.003
british              0.003

20 Topics

In [101]:
%time lda_model3 = gensim.models.ldamodel.LdaModel(corpus, num_topics=20, id2word=dictionary)
print_lda_model(lda_model3, 20)
CPU times: user 56.1 s, sys: 0 ns, total: 56.1 s
Wall time: 56.2 s
term                 frequency



Topic 0 |---------------------

episode              0.028
season               0.017
television           0.011
oscar                0.011
kelly                0.010
award                0.008
stars                0.008
ben                  0.007
welles               0.007
al                   0.007


Topic 1 |---------------------

town                 0.016
james                0.011
western              0.010
scott                0.010
bond                 0.008
white                0.008
george               0.008
richard              0.007
william              0.007
small                0.007


Topic 2 |---------------------

guys                 0.013
car                  0.010
rock                 0.009
cool                 0.008
match                0.007
oh                   0.006
fight                0.006
band                 0.006
stuff                0.006
hot                  0.005


Topic 3 |---------------------

novel                0.017
wonderful            0.016
jane                 0.014
jack                 0.013
french               0.010
tony                 0.010
perfect              0.009
adaptation           0.008
henry                0.008
paris                0.008


Topic 4 |---------------------

country              0.010
america              0.008
documentary          0.008
city                 0.008
german               0.007
history              0.006
south                0.006
americans            0.005
french               0.005
political            0.005


Topic 5 |---------------------

game                 0.029
video                0.020
animation            0.018
disney               0.014
animated             0.011
christmas            0.010
voice                0.009
cartoon              0.008
copy                 0.007
release              0.007


Topic 6 |---------------------

history              0.006
reality              0.006
society              0.005
social               0.005
stories              0.005
view                 0.004
portrayed            0.004
historical           0.003
self                 0.003
important            0.003


Topic 7 |---------------------

harry                0.009
douglas              0.005
barbara              0.005
chris                0.004
british              0.004
morgan               0.003
richard              0.003
dean                 0.003
wonderful            0.003
anthony              0.003


Topic 8 |---------------------

mother               0.020
son                  0.017
daughter             0.012
husband              0.012
friend               0.010
brother              0.010
wants                0.009
sister               0.008
finds                0.007
lives                0.007


Topic 9 |---------------------

children             0.043
child                0.026
boy                  0.024
age                  0.011
dog                  0.011
kid                  0.010
remember             0.010
adult                0.009
city                 0.008
today                0.008


Topic 10 |---------------------

space                0.015
fi                   0.015
sci                  0.015
earth                0.014
human                0.011
science              0.010
planet               0.010
future               0.009
bruce                0.008
alien                0.008


Topic 11 |---------------------

loved                0.013
liked                0.010
went                 0.008
remember             0.008
wanted               0.008
couldn               0.007
came                 0.007
felt                 0.007
enjoyed              0.006
understand           0.006


Topic 12 |---------------------

joe                  0.015
paul                 0.014
gay                  0.012
stewart              0.011
smith                0.011
michael              0.011
jim                  0.011
mary                 0.010
peter                0.008
robert               0.007


Topic 13 |---------------------

musical              0.020
tom                  0.018
dance                0.018
song                 0.017
dancing              0.013
allen                0.012
singing              0.012
joan                 0.010
arthur               0.010
cat                  0.010


Topic 14 |---------------------

killer               0.019
blood                0.012
gore                 0.012
murder               0.010
scary                0.008
police               0.008
creepy               0.008
kill                 0.007
killed               0.007
evil                 0.007


Topic 15 |---------------------

terrible             0.010
awful                0.010
dialogue             0.008
waste                0.008
worse                0.006
poorly               0.006
decent               0.005
totally              0.005
written              0.005
dull                 0.005


Topic 16 |---------------------

strong               0.005
dark                 0.005
viewer               0.005
style                0.004
thriller             0.004
noir                 0.004
human                0.004
direction            0.004
genre                0.004
drama                0.004


Topic 17 |---------------------

monster              0.015
zombie               0.012
italian              0.009
island               0.008
castle               0.008
trek                 0.008
ray                  0.008
creature             0.007
king                 0.007
giant                0.006


Topic 18 |---------------------

cinema               0.010
art                  0.009
japanese             0.009
style                0.006
violence             0.005
sound                0.005
experience           0.004
truly                0.004
tale                 0.004
japan                0.004


Topic 19 |---------------------

humor                0.014
romantic             0.008
recommend            0.007
definitely           0.006
laugh                0.006
perfect              0.006
wonderful            0.006
comedies             0.006
felt                 0.006
indian               0.005
In [102]:
vis = pyLDAvis.gensim.prepare(lda_model3, corpus, dictionary)
vis
/opt/conda/lib/python3.6/site-packages/pyLDAvis/_prepare.py:257: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  return pd.concat([default_term_info] + list(topic_dfs))
Out[102]:
In [103]:
cm = CoherenceModel(model=lda_model3, corpus=corpus, coherence='u_mass')
cm.get_coherence_per_topic()
Out[103]:
[-4.0765807711997537,
 -3.4307358901301019,
 -2.7738421794966417,
 -3.8817762868706147,
 -3.1853028422074328,
 -3.837813474368112,
 -3.1343350942591077,
 -4.6480626417874511,
 -2.7598184755426689,
 -4.4164862107081211,
 -3.3081295042528907,
 -2.6764678976781258,
 -3.9636387497952108,
 -5.5030534521145764,
 -2.7309481694233959,
 -2.7354202176372273,
 -2.7277660868966223,
 -3.8899421323489918,
 -3.2113636071059233,
 -3.0749731048333078]

50 topics

In [104]:
%time lda_model= gensim.models.ldamodel.LdaModel(corpus, num_topics=50, id2word=dictionary)
print_lda_model(lda_model, 50)
CPU times: user 1min 32s, sys: 1min 2s, total: 2min 34s
Wall time: 1min 9s
term                 frequency



Topic 0 |---------------------

earth                0.031
space                0.027
monster              0.020
science              0.019
planet               0.018
human                0.015
alien                0.014
computer             0.014
fiction              0.013
giant                0.013


Topic 1 |---------------------

written              0.028
acted                0.020
writing              0.015
jackie               0.013
badly                0.012
directed             0.012
chan                 0.012
festival             0.011
poorly               0.011
ryan                 0.011


Topic 2 |---------------------

charlie              0.043
studio               0.018
spectacular          0.015
navy                 0.015
karloff              0.014
jr                   0.013
donald               0.013
ralph                0.012
contract             0.012
sutherland           0.011


Topic 3 |---------------------

dog                  0.028
art                  0.021
released             0.018
release              0.017
video                0.010
available            0.009
future               0.009
hate                 0.008
redemption           0.008
uk                   0.008


Topic 4 |---------------------

violence             0.021
violent              0.016
nick                 0.012
sky                  0.011
heaven               0.009
rape                 0.009
giallo               0.007
intense              0.007
fbi                  0.007
eastwood             0.007


Topic 5 |---------------------

mystery              0.019
dark                 0.016
evil                 0.015
castle               0.012
atmosphere           0.010
mysterious           0.010
cult                 0.008
nancy                0.008
devil                0.008
jason                0.008


Topic 6 |---------------------

humor                0.045
stewart              0.034
queen                0.025
prince               0.024
batman               0.023
award                0.020
robin                0.020
amazing              0.019
hilarious            0.019
academy              0.015


Topic 7 |---------------------

video                0.030
rent                 0.021
waste                0.020
store                0.020
piece                0.016
crap                 0.014
buy                  0.012
remake               0.012
matthau              0.011
rented               0.011


Topic 8 |---------------------

viewer               0.009
style                0.008
noir                 0.006
drama                0.006
strong               0.005
works                0.005
narrative            0.005
visual               0.005
despite              0.005
dialogue             0.004


Topic 9 |---------------------

guys                 0.015
laugh                0.015
oh                   0.012
stone                0.010
club                 0.010
stop                 0.010
talk                 0.009
laughing             0.009
talking              0.008
hell                 0.008


Topic 10 |---------------------

british              0.030
bond                 0.015
today                0.014
grant                0.012
welles               0.011
richard              0.011
finest               0.011
public               0.010
america              0.009
african              0.008


Topic 11 |---------------------

animation            0.038
animated             0.027
edge                 0.025
cat                  0.024
fantasy              0.022
adventure            0.020
steve                0.018
anime                0.016
gordon               0.010
howard               0.010


Topic 12 |---------------------

gay                  0.038
ray                  0.027
lynch                0.017
straight             0.017
david                0.015
parker               0.014
oscar                0.013
wave                 0.012
deserved             0.011
refreshing           0.011


Topic 13 |---------------------

jack                 0.063
zombie               0.029
mad                  0.026
sequel               0.025
parody               0.015
columbo              0.015
max                  0.014
roger                0.011
hearted              0.010
beast                0.010


Topic 14 |---------------------

lee                  0.047
western              0.044
eddie                0.030
murphy               0.020
chase                0.018
mexican              0.016
driver               0.016
car                  0.016
chuck                0.015
oil                  0.014


Topic 15 |---------------------

awful                0.013
terrible             0.012
cool                 0.012
mean                 0.010
awesome              0.009
lines                0.009
horrible             0.009
worse                0.008
couldn               0.008
cheesy               0.007


Topic 16 |---------------------

police               0.027
murder               0.026
killer               0.017
kill                 0.015
killed               0.014
cop                  0.014
crime                0.014
car                  0.013
detective            0.013
friend               0.011


Topic 17 |---------------------

blood                0.024
gore                 0.021
killer               0.014
genre                0.012
nudity               0.012
fans                 0.012
slasher              0.011
flick                0.011
adam                 0.011
80                   0.010


Topic 18 |---------------------

musical              0.027
song                 0.027
dance                0.020
kelly                0.014
dancing              0.014
singing              0.014
charming             0.010
jones                0.010
delightful           0.009
number               0.009


Topic 19 |---------------------

mr                   0.053
paul                 0.032
smith                0.026
early                0.018
billy                0.018
ms                   0.015
career               0.015
ann                  0.014
taylor               0.013
mary                 0.012


Topic 20 |---------------------

town                 0.025
prison               0.015
christmas            0.015
gang                 0.013
law                  0.013
brother              0.011
hero                 0.010
mike                 0.009
boss                 0.008
gun                  0.008


Topic 21 |---------------------

scary                0.036
creepy               0.021
ghost                0.020
scared               0.014
witch                0.013
julie                0.010
spoiler              0.010
rights               0.009
scare                0.008
starts               0.008


Topic 22 |---------------------

game                 0.094
emma                 0.016
caine                0.014
sarah                0.013
video                0.013
voice                0.012
lloyd                0.012
sound                0.012
playing              0.011
print                0.011


Topic 23 |---------------------

girls                0.051
class                0.022
college              0.018
page                 0.017
teen                 0.014
student              0.013
drama                0.011
steven               0.010
teacher              0.010
carter               0.009


Topic 24 |---------------------

city                 0.037
rock                 0.023
band                 0.017
york                 0.017
rose                 0.009
red                  0.009
victoria             0.009
army                 0.008
maria                0.008
australian           0.007


Topic 25 |---------------------

documentary          0.030
footage              0.011
segment              0.007
subject              0.007
south                0.007
media                0.007
comment              0.006
art                  0.006
africa               0.006
actual               0.006


Topic 26 |---------------------

comic                0.036
frank                0.032
superb               0.027
fox                  0.025
chris                0.024
joan                 0.022
morgan               0.018
sinatra              0.018
victor               0.017
charles              0.014


Topic 27 |---------------------

children             0.039
mother               0.039
child                0.032
son                  0.032
daughter             0.020
husband              0.015
baby                 0.014
adult                0.012
age                  0.012
lives                0.010


Topic 28 |---------------------

george               0.041
robert               0.040
scott                0.034
allen                0.031
van                  0.030
ford                 0.022
fred                 0.019
horse                0.019
wayne                0.018
woody                0.017


Topic 29 |---------------------

italian              0.029
train                0.022
david                0.017
tarzan               0.015
hitler               0.014
nazi                 0.013
chaplin              0.010
wwii                 0.010
italy                0.010
jewish               0.009


Topic 30 |---------------------

stories              0.018
relationship         0.010
truly                0.009
powerful             0.007
visually             0.007
gives                0.007
tension              0.007
felt                 0.006
style                0.006
told                 0.006


Topic 31 |---------------------

cartoon              0.048
simon                0.029
eric                 0.027
remembered           0.023
satire               0.022
revolution           0.019
susan                0.018
mature               0.017
crafted              0.016
sea                  0.015


Topic 32 |---------------------

fi                   0.070
sci                  0.070
opera                0.036
wars                 0.028
drew                 0.024
cinderella           0.024
channel              0.023
brown                0.023
soap                 0.023
midnight             0.019


Topic 33 |---------------------

air                  0.014
room                 0.012
waiting              0.011
wait                 0.011
powell               0.010
daniel               0.009
baseball             0.008
mgm                  0.008
started              0.007
happen               0.007


Topic 34 |---------------------

chinese              0.039
christopher          0.030
london               0.029
kong                 0.029
hong                 0.026
delight              0.023
bette                0.022
china                0.019
universal            0.019
magic                0.018


Topic 35 |---------------------

loved                0.044
enjoyed              0.026
liked                0.024
remember             0.018
favorite             0.013
recommend            0.013
ago                  0.013
came                 0.012
perfect              0.011
definitely           0.010


Topic 36 |---------------------

rating               0.015
rated                0.013
amazing              0.012
remember             0.012
imdb                 0.011
underrated           0.010
copy                 0.008
price                0.008
ago                  0.008
highly               0.007


Topic 37 |---------------------

outstanding          0.031
jim                  0.030
bruce                0.026
jeff                 0.018
moore                0.017
news                 0.016
brian                0.015
judge                0.012
ted                  0.012
jake                 0.009


Topic 38 |---------------------

editing              0.020
shots                0.019
sound                0.016
island               0.013
captain              0.012
direction            0.011
cinematography       0.011
radio                0.011
cut                  0.008
desert               0.008


Topic 39 |---------------------

tom                  0.055
tony                 0.042
peter                0.030
jerry                0.026
dan                  0.019
bob                  0.017
robot                0.016
sullivan             0.015
fame                 0.015
leslie               0.012


Topic 40 |---------------------

joe                  0.039
dr                   0.034
doctor               0.031
wonderfully          0.020
marie                0.018
hospital             0.016
lewis                0.016
andy                 0.015
anderson             0.013
david                0.013


Topic 41 |---------------------

japanese             0.018
power                0.017
political            0.014
country              0.013
government           0.011
history              0.011
battle               0.010
russian              0.009
god                  0.009
freedom              0.009


Topic 42 |---------------------

french               0.063
disney               0.052
paris                0.026
uncle                0.025
beauty               0.023
anne                 0.017
voice                0.016
sean                 0.016
france               0.015
bourne               0.014


Topic 43 |---------------------

english              0.030
german               0.016
remarkable           0.012
silent               0.011
party                0.010
era                  0.010
language             0.010
stage                0.008
keaton               0.008
lucy                 0.007


Topic 44 |---------------------

episode              0.128
season               0.063
harry                0.044
television           0.025
kevin                0.022
kate                 0.021
program              0.017
week                 0.015
rob                  0.014
league               0.014


Topic 45 |---------------------

team                 0.041
boy                  0.041
match                0.039
white                0.026
kid                  0.026
arthur               0.024
jimmy                0.017
race                 0.017
win                  0.015
vs                   0.014


Topic 46 |---------------------

novel                0.031
jane                 0.022
romantic             0.019
henry                0.017
adaptation           0.014
books                0.011
based                0.009
period               0.009
comedies             0.008
humour               0.008


Topic 47 |---------------------

king                 0.044
michael              0.037
marriage             0.022
ed                   0.016
wedding              0.015
jackson              0.014
married              0.014
happiness            0.011
wood                 0.011
perfection           0.011


Topic 48 |---------------------

indian               0.027
al                   0.023
gangster             0.019
fu                   0.019
dennis               0.019
soft                 0.019
kung                 0.018
porn                 0.015
core                 0.015
india                0.015


Topic 49 |---------------------

human                0.021
lives                0.015
cinema               0.013
society              0.011
live                 0.010
nature               0.010
culture              0.009
journey              0.009
deep                 0.008
deeply               0.008
In [106]:
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
vis
/opt/conda/lib/python3.6/site-packages/pyLDAvis/_prepare.py:257: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  return pd.concat([default_term_info] + list(topic_dfs))
Out[106]:
In [107]:
cm = CoherenceModel(model=lda_model, corpus=corpus, coherence='u_mass')
cm.get_coherence_per_topic()
Out[107]:
[-3.1959912642516017,
 -3.7864467355157498,
 -5.2877172499689031,
 -4.6570527288208909,
 -4.3340095336978948,
 -4.08800263435191,
 -4.0684929158484193,
 -4.0219349012934078,
 -2.779889969575227,
 -2.8909820164003528,
 -3.8822952188402935,
 -4.0556834435572213,
 -4.6651730347215095,
 -6.083052719327001,
 -5.7440020163729333,
 -2.7290093530143955,
 -2.7822834462031039,
 -3.807187709969083,
 -3.3463411442814204,
 -4.6910487284751792,
 -3.4077741749402022,
 -4.232950004304123,
 -5.6702039311351777,
 -4.8511234736227244,
 -4.2299260183421472,
 -3.7182477025248066,
 -5.8453044921725308,
 -3.1236199004038085,
 -4.7705036950291371,
 -5.3370308169071752,
 -3.1690870133629443,
 -5.5233060905314781,
 -7.1934873155874941,
 -3.7135921491687984,
 -7.983885333780175,
 -2.8804588736167349,
 -3.2524308829669262,
 -5.4136674682478336,
 -3.6440072877477285,
 -5.6332693776028275,
 -4.3824939067771478,
 -3.621320936875791,
 -5.5165631365498458,
 -4.407471162878088,
 -4.3545338806656071,
 -5.1993310285014607,
 -3.8821827154224544,
 -4.5271957974075363,
 -5.7891195172519572,
 -3.2479417700105304]

Limite de Reviews por Película

Un primer filtro que podemos aplicar para evitar bias es un límite sobre el número de revies para una misma película. Usaremos un parámetro configurable con un valor inicial de 10 después de estudiar la primera gráfica

In [35]:
ids_by_path = {}
for urls_file in walk_corpus('./resources/aclImdb/all/', 'urls.urls'):
    dirname = os.path.dirname(urls_file)
    with open(urls_file) as f:
            ids_map = {}
            lines = f.readlines()
            for index, line in enumerate(lines):
                movie_id = id_pattern.search(line).group(1)
                ids_map[index] = movie_id
            ids_by_path[dirname] = ids_map
            
line_id_pattern = re.compile('([0-9]+)_[0-9]+')
In [36]:
def tokenize_corpus(path, pattern, min_df=1, mode='d', limit=10):
    movie_counter = Counter()

    for corpus_file in walk_corpus(path, pattern):
        dirname = os.path.dirname(corpus_file)
        line_id = int(line_id_pattern.search(corpus_file).group(1))
        ids_map = ids_by_path[dirname]
        movie_id = ids_map[line_id]
        if movie_counter[movie_id] <= limit:
            movie_counter[movie_id] += 1
            with open(corpus_file, 'r') as next_file:
                next_review = next_file.read()
                tokens = analyzer(next_review)
                if mode == 'd':
                    yield tokens
                else:
                    for token in tokens:
                        yield token
In [10]:
%time dist2 = FreqDist(tokenize_corpus('./resources/aclImdb/all/', '*.txt', mode='t'))
print(dist2)
pp.pprint(dist2.most_common(100))
CPU times: user 28.4 s, sys: 16.1 s, total: 44.5 s
Wall time: 3min 38s
<FreqDist with 123728 samples and 7949267 outcomes>
[('br', 288630), ('movie', 122232), ('film', 112554), ('like', 58209), ('just', 51314),
 ('good', 41899), ('time', 35474), ('really', 32889), ('story', 32581), ('bad', 28692),
 ('people', 25751), ('don', 25288), ('great', 23732), ('make', 23451), ('way', 22204),
 ('movies', 21570), ('characters', 20484), ('character', 19761), ('films', 19659), ('think', 19640),
 ('watch', 19592), ('plot', 19382), ('acting', 18633), ('seen', 18399), ('little', 18011),
 ('did', 17771), ('know', 17664), ('life', 17515), ('love', 17289), ('better', 16980),
 ('best', 16617), ('does', 16488), ('man', 16318), ('end', 15917), ('scene', 15616),
 ('scenes', 15126), ('say', 14884), ('ve', 14401), ('real', 13451), ('thing', 13275),
 ('watching', 13227), ('doesn', 13076), ('didn', 12945), ('director', 12929), ('actors', 12838),
 ('old', 12596), ('funny', 12504), ('actually', 12413), ('years', 12339), ('work', 12030),
 ('going', 11931), ('look', 11888), ('10', 11647), ('new', 11628), ('makes', 11568), ('lot', 11471),
 ('pretty', 10733), ('want', 10526), ('cast', 10502), ('things', 10340), ('quite', 10282),
 ('world', 10257), ('fact', 10184), ('young', 10163), ('long', 10058), ('got', 9992),
 ('series', 9916), ('horror', 9857), ('big', 9733), ('action', 9713), ('thought', 9559),
 ('interesting', 9384), ('comedy', 9356), ('guy', 9326), ('isn', 9263), ('right', 9091),
 ('script', 9036), ('minutes', 9036), ('gets', 8990), ('come', 8883), ('point', 8878),
 ('music', 8825), ('saw', 8819), ('original', 8729), ('role', 8693), ('times', 8682), ('tv', 8617),
 ('far', 8573), ('bit', 8553), ('worst', 8357), ('making', 8307), ('ll', 8164), ('girl', 8135),
 ('family', 8100), ('feel', 8028), ('probably', 8007), ('away', 7975), ('kind', 7952),
 ('woman', 7871), ('hard', 7788)]
In [4]:
%time dictionary2 = gensim.corpora.Dictionary(tokenize_corpus('./resources/aclImdb/all/', '*.txt'))
data = [[dictionary2.num_docs, dictionary2.num_pos, len(dictionary2.token2id)]]
df = pd.DataFrame(data)
df.columns=['Numero de reviews analizadas', 'Numero de tokens analizados', 'Numero de tokens únicos actuales']
df.head()
Out[4]:
Numero de reviews analizadas Numero de tokens analizados Numero de tokens únicos actuales
0 72310 7949267 123728
In [83]:
dictionary2.save('limited.dict')

Aplicando normalización y filtrado de vocabulario

In [17]:
dictionary_filter_most_frequent(dictionary2, dist2)
dictionary2.filter_extremes(no_below=10, no_above=0.5)
plurals = normalize_dictionary(dictionary2)
dictionary2.save('limited.normalized.dict')
corpus = MovieCorpus("./resources/aclImdb/all", dictionary2)
gensim.corpora.MmCorpus.serialize('corpus4.mm', corpus)
corpus = gensim.corpora.MmCorpus('corpus4.mm')
%time lda_model4 = gensim.models.ldamodel.LdaModel(corpus, num_topics=20, id2word=dictionary2)
print_lda_model(lda_model4)
term                 frequency



Topic 0 |---------------------

animation            0.006
fu                   0.005
disney               0.005
kung                 0.005
fight                0.005
loved                0.004
chinese              0.004
recommend            0.004
enjoy                0.004
wish                 0.004


Topic 1 |---------------------

police               0.005
murder               0.005
killed               0.004
prison               0.004
hero                 0.004
battle               0.004
kill                 0.004
crime                0.003
japanese             0.003
army                 0.003


Topic 2 |---------------------

fi                   0.012
sci                  0.011
pilot                0.008
adam                 0.007
air                  0.005
alien                0.005
release              0.004
portuguese           0.004
robot                0.004
plane                0.004


Topic 3 |---------------------

charlie              0.007
girls                0.005
slasher              0.004
matt                 0.003
nudity               0.003
hot                  0.003
rooney               0.002
steve                0.002
genre                0.002
massacre             0.002


Topic 4 |---------------------

match                0.013
vs                   0.006
jim                  0.005
castle               0.005
fans                 0.004
concert              0.004
ring                 0.004
sean                 0.004
von                  0.003
party                0.003


Topic 5 |---------------------

children             0.009
documentary          0.008
humor                0.005
political            0.005
understand           0.004
self                 0.004
today                0.004
person               0.004
child                0.004
society              0.004


Topic 6 |---------------------

game                 0.027
remember             0.010
bond                 0.007
columbo              0.007
guys                 0.006
fight                0.005
came                 0.005
marie                0.005
buy                  0.004
playing              0.004


Topic 7 |---------------------

season               0.006
laugh                0.005
episodes             0.005
stuff                0.004
lines                0.004
horrible             0.004
annoying             0.003
ok                   0.003
cheesy               0.003
couldn               0.003


Topic 8 |---------------------

jane                 0.012
romantic             0.010
song                 0.007
anime                0.006
indian               0.006
romance              0.006
opera                0.006
flynn                0.005
tarzan               0.004
glover               0.004


Topic 9 |---------------------

space                0.012
earth                0.008
planet               0.007
island               0.006
monster              0.006
science              0.005
crew                 0.005
trek                 0.005
blood                0.005
van                  0.004


Topic 10 |---------------------

police               0.004
alex                 0.004
hickock              0.003
murder               0.003
ken                  0.003
woo                  0.003
captivating          0.003
frenchman            0.002
caine                0.002
law                  0.002


Topic 11 |---------------------

batman               0.007
master               0.006
chan                 0.006
anderson             0.005
french               0.004
superman             0.004
russell              0.004
cop                  0.003
puppet               0.003
hung                 0.003


Topic 12 |---------------------

town                 0.011
western              0.010
joe                  0.007
rock                 0.007
band                 0.005
footage              0.005
serial               0.004
documentary          0.004
san                  0.004
killer               0.003


Topic 13 |---------------------

cinema               0.007
atmosphere           0.005
genre                0.004
quality              0.004
art                  0.004
cinematography       0.003
style                0.003
sound                0.003
definitely           0.003
dialogue             0.003


Topic 14 |---------------------

performances         0.004
lives                0.004
relationship         0.004
son                  0.004
wonderful            0.003
mother               0.003
beautifully          0.003
french               0.003
drama                0.003
experience           0.003


Topic 15 |---------------------

history              0.006
gay                  0.006
novel                0.005
television           0.004
heart                0.003
century              0.003
based                0.003
wonderful            0.003
country              0.003
stories              0.003


Topic 16 |---------------------

david                0.007
george               0.006
james                0.006
mr                   0.006
richard              0.006
mary                 0.005
king                 0.005
garbo                0.005
peter                0.005
jack                 0.005


Topic 17 |---------------------

wonderful            0.007
musical              0.006
oscar                0.005
perfect              0.005
singing              0.004
dance                0.004
actress              0.004
kelly                0.004
performances         0.004
song                 0.004


Topic 18 |---------------------

boy                  0.010
mother               0.009
girls                0.008
sister               0.006
child                0.006
powell               0.006
daughter             0.005
baby                 0.005
children             0.004
ghost                0.004


Topic 19 |---------------------

tom                  0.010
harry                0.007
joan                 0.007
barbara              0.006
sam                  0.005
bruce                0.005
jerry                0.005
musical              0.005
fox                  0.004
ray                  0.004
In [19]:
vis = pyLDAvis.gensim.prepare(lda_model4, corpus, dictionary2)
vis
/opt/conda/lib/python3.6/site-packages/pyLDAvis/_prepare.py:257: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  return pd.concat([default_term_info] + list(topic_dfs))
Out[19]:
In [22]:
cm = CoherenceModel(model=lda_model4, corpus=corpus, coherence='u_mass')
cm.get_coherence_per_topic()
Out[22]:
[-3.4736747072019507,
 -2.8836454636056046,
 -6.4974293614709877,
 -4.0944689230022089,
 -6.9321725144710804,
 -2.8213212081478165,
 -3.4248214782250987,
 -2.8335608279688715,
 -4.7652011069303937,
 -4.1914703393288022,
 -7.5931346022033264,
 -6.4087057457649843,
 -4.2137898254407338,
 -2.8032934629011583,
 -2.7342576184178395,
 -2.9198050705840188,
 -3.2274850244245141,
 -3.1154570665398076,
 -3.262508068311154,
 -3.9860395406851143]
In [38]:
#dictionary_keep_n_frequent(dictionary2, dist2)
corpus = MovieCorpus("./resources/aclImdb/all", dictionary2)
gensim.corpora.MmCorpus.serialize('corpus5.mm', corpus)
corpus = gensim.corpora.MmCorpus('corpus5.mm')
%time lda_model5 = gensim.models.ldamodel.LdaModel(corpus, num_topics=20, id2word=dictionary2)
lda_model5.save('limited.normalized.filtered.model')
print_lda_model(lda_model5)
CPU times: user 41.9 s, sys: 0 ns, total: 41.9 s
Wall time: 42.6 s
term                 frequency



Topic 0 |---------------------

mr                   0.016
king                 0.011
wonderful            0.011
novel                0.010
jane                 0.008
mary                 0.007
television           0.007
richard              0.007
remember             0.007
david                0.006


Topic 1 |---------------------

gang                 0.009
summer               0.007
ring                 0.006
white                0.006
african              0.006
delight              0.005
australian           0.005
buddy                0.005
candy                0.005
remembered           0.005


Topic 2 |---------------------

game                 0.041
town                 0.015
lee                  0.012
fight                0.010
jack                 0.010
western              0.009
van                  0.008
martial              0.008
charlie              0.008
tony                 0.008


Topic 3 |---------------------

children             0.009
child                0.006
match                0.006
remember             0.006
live                 0.006
boy                  0.005
kid                  0.005
loved                0.005
age                  0.004
indian               0.004


Topic 4 |---------------------

cool                 0.013
space                0.013
fi                   0.013
sci                  0.012
earth                0.012
guys                 0.010
awesome              0.010
planet               0.009
oh                   0.009
alien                0.008


Topic 5 |---------------------

mother               0.011
son                  0.008
relationship         0.007
lives                0.007
husband              0.007
daughter             0.006
sister               0.005
boy                  0.005
married              0.004
child                0.004


Topic 6 |---------------------

french               0.022
british              0.017
german               0.016
english              0.015
garbo                0.011
powell               0.010
silent               0.010
spanish              0.008
anna                 0.007
france               0.006


Topic 7 |---------------------

japanese             0.010
understand           0.006
reality              0.005
season               0.005
human                0.005
message              0.004
japan                0.004
century              0.004
political            0.004
history              0.004


Topic 8 |---------------------

fu                   0.012
kung                 0.011
chinese              0.011
violence             0.010
harris               0.007
disturbing           0.007
china                0.005
maria                0.005
von                  0.005
truly                0.005


Topic 9 |---------------------

dr                   0.018
doctor               0.016
castle               0.012
monster              0.008
wood                 0.008
ed                   0.007
science              0.007
mad                  0.006
professor            0.006
cook                 0.006


Topic 10 |---------------------

history              0.012
battle               0.008
army                 0.008
west                 0.007
country              0.007
military             0.007
island               0.006
america              0.006
pilot                0.006
president            0.005


Topic 11 |---------------------

gay                  0.024
david                0.011
sexual               0.010
witch                0.009
soft                 0.009
porn                 0.009
russian              0.009
lynch                0.007
rape                 0.007
lesbian              0.006


Topic 12 |---------------------

musical              0.013
song                 0.011
stage                0.011
dance                0.010
singing              0.010
tom                  0.009
dancing              0.007
jerry                0.007
studio               0.006
stars                0.006


Topic 13 |---------------------

murder               0.024
police               0.021
killer               0.015
crime                0.013
cop                  0.011
bond                 0.011
detective            0.010
killed               0.009
mystery              0.008
thriller             0.008


Topic 14 |---------------------

sam                  0.005
joan                 0.004
nick                 0.004
early                0.004
giallo               0.004
henry                0.004
marie                0.004
michael              0.004
picture              0.004
score                0.004


Topic 15 |---------------------

documentary          0.007
art                  0.006
cinema               0.006
style                0.005
performances         0.005
drama                0.004
experience           0.004
direction            0.004
viewer               0.004
wonderful            0.004


Topic 16 |---------------------

columbo              0.008
disappointed         0.006
waste                0.006
episodes             0.005
hour                 0.005
couldn               0.005
worse                0.005
went                 0.005
wish                 0.005
looked               0.005


Topic 17 |---------------------

blood                0.009
evil                 0.007
gore                 0.005
genre                0.005
atmosphere           0.004
sequence             0.004
chan                 0.004
fans                 0.004
kong                 0.004
creepy               0.003


Topic 18 |---------------------

flick                0.007
horrible             0.007
dialogue             0.006
crap                 0.006
mean                 0.005
totally              0.005
girls                0.005
imdb                 0.005
waste                0.005
title                0.005


Topic 19 |---------------------

animation            0.020
cartoon              0.017
disney               0.014
animated             0.014
humor                0.011
episodes             0.011
anime                0.011
voice                0.010
dog                  0.010
batman               0.008
In [40]:
vis = pyLDAvis.gensim.prepare(lda_model5, corpus, dictionary2)
vis
/opt/conda/lib/python3.6/site-packages/pyLDAvis/_prepare.py:257: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  return pd.concat([default_term_info] + list(topic_dfs))
Out[40]:
In [41]:
cm = CoherenceModel(model=lda_model5, corpus=corpus, coherence='u_mass')
cm.get_coherence_per_topic()
Out[41]:
[-3.1017010906431257,
 -5.3863781101563113,
 -3.4860329584458936,
 -2.7373220621914816,
 -3.1839373770756687,
 -2.6773250147031082,
 -5.3382598246227735,
 -3.1953150397382051,
 -6.3205596968557627,
 -4.8392203124523254,
 -3.3690709920057973,
 -3.9141097777850793,
 -3.4078978572067,
 -3.2075941275445472,
 -3.4974114609353761,
 -2.783888093090638,
 -3.245974438692937,
 -2.9838146872888958,
 -3.1845560764441534,
 -3.3340415011957463]

Filtrando Vocabulario Polarizado

En todos los modelos anteriores hemos visto que existen varios tipos de palabras que concurrente aparecen con bastante frecuencia en varios tópicos, pero que aportan escaso valor a la hora de categorizar por temáticas. Algunos ejemplos de estos tipos de palabras son nombres propios y verbos que podemos filtrar

In [42]:
def dictionary_filter_neutral(dictionary, polarity=0.5):
    from textblob import TextBlob
    neutrals = []
    for token in dictionary.values():
        if len(token) > 1:
            upper = token[0].upper() + token[1:]
        blob = TextBlob(upper)
        if abs(blob.polarity) <= polarity and blob.pos_tags[0][1] != 'NNP' and not blob.pos_tags[0][1].startswith('VB'):
            neutrals.append(dictionary.token2id[token])
                    
    dictionary.filter_tokens(good_ids=neutrals)
    dictionary.compactify()
    return neutrals
In [43]:
dictionary = gensim.corpora.Dictionary.load('normalized.v1.dict')
print('Número de palabras iniciales: {}'.format(len(dictionary)))
neutrals = dictionary_filter_neutral(dictionary, 0.0)
print("Número de palabras neutrales: {}".format(len(neutrals)))
Número de palabras iniciales: 30425
Número de palabras neutrales: 21623
In [49]:
dictionary_keep_n_frequent(dictionary, dist)
corpus = MovieCorpus("./resources/aclImdb/all", dictionary)
gensim.corpora.MmCorpus.serialize('corpus6.mm', corpus)
corpus = gensim.corpora.MmCorpus('corpus6.mm')
%time lda_model_n = gensim.models.ldamodel.LdaModel(corpus, num_topics=20, id2word=dictionary)
print_lda_model(lda_model_n)
term                 frequency



Topic 0 |---------------------

japanese             0.017
history              0.017
political            0.012
country              0.011
chinese              0.009
power                0.009
government           0.009
indian               0.009
americans            0.008
fight                0.008


Topic 1 |---------------------

police               0.019
murder               0.017
town                 0.017
crime                0.014
noir                 0.010
prison               0.010
scott                0.009
mr                   0.008
cop                  0.008
finds                0.007


Topic 2 |---------------------

match                0.017
piece                0.011
style                0.011
art                  0.010
italian              0.010
russian              0.008
genre                0.008
segment              0.008
masterpiece          0.007
truly                0.007


Topic 3 |---------------------

documentary          0.011
understand           0.011
person               0.010
lives                0.010
society              0.009
message              0.008
self                 0.007
reality              0.007
experience           0.007
change               0.006


Topic 4 |---------------------

dialogue             0.014
humour               0.012
opera                0.010
lines                0.010
camp                 0.009
class                0.008
humor                0.007
emma                 0.007
mr                   0.006
definitely           0.006


Topic 5 |---------------------

girls                0.021
guys                 0.018
car                  0.011
martial              0.009
fox                  0.009
matt                 0.007
baseball             0.007
friend               0.007
danny                0.007
notch                0.006


Topic 6 |---------------------

children             0.033
child                0.029
mother               0.026
son                  0.018
brother              0.013
daughter             0.009
baby                 0.009
barbara              0.008
nick                 0.007
uncle                0.007


Topic 7 |---------------------

fi                   0.019
sci                  0.019
earth                0.010
marie                0.009
ship                 0.009
crew                 0.009
captain              0.008
future               0.008
planet               0.007
hero                 0.007


Topic 8 |---------------------

animation            0.035
disney               0.029
voice                0.021
cartoon              0.019
prince               0.015
cat                  0.015
bond                 0.014
jerry                0.013
robin                0.012
adventure            0.011


Topic 9 |---------------------

episode              0.059
season               0.028
space                0.019
footage              0.013
television           0.013
jason                0.011
mike                 0.010
pilot                0.009
trek                 0.009
team                 0.008


Topic 10 |---------------------

oscar                0.018
actress              0.014
song                 0.013
direction            0.012
cinema               0.011
romantic             0.011
thriller             0.010
picture              0.009
award                0.009
drama                0.009


Topic 11 |---------------------

novel                0.016
british              0.011
german               0.009
joe                  0.008
century              0.008
period               0.007
era                  0.007
adaptation           0.007
jane                 0.007
henry                0.006


Topic 12 |---------------------

christmas            0.016
al                   0.012
queen                0.011
tarzan               0.009
flynn                0.009
ms                   0.008
pacino               0.007
mr                   0.007
hotel                0.006
lion                 0.006


Topic 13 |---------------------

felt                 0.007
viewer               0.006
attention            0.005
eye                  0.005
review               0.005
moment               0.005
word                 0.004
imdb                 0.004
gem                  0.004
com                  0.004


Topic 14 |---------------------

video                0.030
rent                 0.010
copy                 0.010
store                0.010
came                 0.009
release              0.009
ago                  0.009
80                   0.008
fans                 0.006
flick                0.006


Topic 15 |---------------------

musical              0.022
tony                 0.016
rock                 0.016
dance                0.015
kelly                0.014
humor                0.011
ray                  0.011
stage                0.010
billy                0.009
band                 0.009


Topic 16 |---------------------

relationship         0.013
husband              0.012
heart                0.010
marriage             0.008
joan                 0.007
beauty               0.007
grant                0.007
drama                0.007
gives                0.007
emotional            0.006


Topic 17 |---------------------

blood                0.017
gore                 0.014
monster              0.011
zombie               0.009
dr                   0.009
flick                0.008
fans                 0.008
genre                0.008
nudity               0.008
violence             0.008


Topic 18 |---------------------

van                  0.011
adam                 0.009
powell               0.008
fight                0.008
jackie               0.008
chan                 0.007
che                  0.007
face                 0.006
head                 0.006
baby                 0.006


Topic 19 |---------------------

ed                   0.015
sequel               0.013
master               0.013
silent               0.012
hitler               0.012
cage                 0.011
trilogy              0.011
oliver               0.010
dean                 0.010
wood                 0.010
In [153]:
lda_model_n.save('neutral.model')
In [59]:
vis = pyLDAvis.gensim.prepare(lda_model_n, corpus, dictionary)
vis
/opt/conda/lib/python3.6/site-packages/pyLDAvis/_prepare.py:257: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  return pd.concat([default_term_info] + list(topic_dfs))
Out[59]:
In [51]:
cm = CoherenceModel(model=lda_model5, corpus=corpus, coherence='u_mass')
cm.get_coherence_per_topic()
Out[51]:
[-10.051165118990651,
 -8.8006380216094762,
 -9.9431637351849318,
 -7.0482833682776844,
 -8.7008648261085781,
 -5.9859593295762243,
 -8.6334704984888369,
 -7.7937311402327136,
 -12.979005149964866,
 -8.4726218042595178,
 -9.5923400322119523,
 -8.8908398039530834,
 -7.524473795226637,
 -11.759763967649146,
 -9.4658877785714015,
 -9.1637515440182611,
 -6.6116748263215745,
 -6.3548996153188346,
 -5.9521530827744735,
 -7.7463274281963574]

Probando otros Modelos de Representación: TF-IDF

In [99]:
dictionary = gensim.corpora.Dictionary.load('normalized.v1.dict')
dictionary_keep_n_frequent(dictionary, dist)
corpus = MovieCorpus("./resources/aclImdb/all", dictionary)
tfidf = gensim.models.TfidfModel(corpus)
%time lda_model6 = gensim.models.ldamodel.LdaModel(tfidf[corpus], num_topics=20, id2word=dictionary)
pp.pprint(lda_model6.print_topics(20))
CPU times: user 2min, sys: 31.9 s, total: 2min 32s
Wall time: 8min 38s
[(0,
  '0.005*"parker" + 0.005*"buster" + 0.005*"tour" + 0.005*"keaton" + 0.004*"grief" + '
  '0.004*"spring" + 0.003*"alexander" + 0.003*"bogart" + 0.003*"game" + 0.003*"ali"'),
 (1,
  '0.004*"terrible" + 0.004*"video" + 0.004*"waste" + 0.004*"awful" + 0.003*"horrible" + '
  '0.003*"totally" + 0.003*"rent" + 0.003*"episode" + 0.003*"recommend" + 0.003*"worse"'),
 (2,
  '0.048*"giallo" + 0.030*"catholic" + 0.027*"le" + 0.027*"sean" + 0.019*"penn" + 0.017*"spike" + '
  '0.017*"bruno" + 0.017*"altman" + 0.017*"heartfelt" + 0.016*"addiction"'),
 (3,
  '0.004*"french" + 0.004*"documentary" + 0.004*"wonderful" + 0.004*"amazing" + 0.003*"superb" + '
  '0.003*"japanese" + 0.003*"anime" + 0.003*"history" + 0.003*"garbo" + 0.003*"european"'),
 (4,
  '0.032*"match" + 0.024*"royal" + 0.021*"richardson" + 0.018*"ian" + 0.018*"louise" + '
  '0.017*"agency" + 0.016*"redemption" + 0.015*"fonda" + 0.015*"sport" + 0.015*"capturing"'),
 (5,
  '0.014*"turkish" + 0.011*"fabulous" + 0.009*"streisand" + 0.008*"poem" + 0.008*"divine" + '
  '0.008*"nick" + 0.007*"grant" + 0.007*"purchased" + 0.007*"goldberg" + 0.007*"maggie"'),
 (6,
  '0.013*"wonderfully" + 0.010*"complaint" + 0.009*"sherlock" + 0.008*"peck" + 0.008*"immensely" + '
  '0.008*"ralph" + 0.008*"tracy" + 0.006*"kicking" + 0.006*"ward" + 0.006*"karloff"'),
 (7,
  '0.017*"kong" + 0.017*"hong" + 0.016*"chan" + 0.009*"jackie" + 0.008*"remind" + 0.007*"delight" '
  '+ 0.007*"nightmare" + 0.007*"explicit" + 0.007*"woody" + 0.007*"wonderfully"'),
 (8,
  '0.021*"powell" + 0.017*"disney" + 0.016*"animation" + 0.013*"cartoon" + 0.013*"batman" + '
  '0.011*"extraordinary" + 0.009*"judy" + 0.009*"animated" + 0.007*"glimpse" + 0.007*"germany"'),
 (9,
  '0.005*"fu" + 0.004*"kung" + 0.003*"martial" + 0.003*"tom" + 0.003*"king" + 0.003*"fight" + '
  '0.003*"prince" + 0.003*"hitler" + 0.003*"oscar" + 0.003*"award"'),
 (10,
  '0.009*"arthur" + 0.008*"bond" + 0.007*"hardy" + 0.007*"marie" + 0.005*"stan" + 0.005*"friendly" '
  '+ 0.005*"superman" + 0.004*"timeless" + 0.004*"lloyd" + 0.004*"mitchum"'),
 (11,
  '0.031*"bette" + 0.019*"christine" + 0.017*"boston" + 0.017*"christmas" + 0.015*"suggested" + '
  '0.015*"russell" + 0.014*"slice" + 0.014*"kurt" + 0.013*"matthau" + 0.013*"et"'),
 (12,
  '0.003*"silent" + 0.003*"era" + 0.003*"greek" + 0.003*"flynn" + 0.002*"sinatra" + '
  '0.002*"documentary" + 0.002*"british" + 0.002*"spy" + 0.002*"gem" + 0.002*"australia"'),
 (13,
  '0.087*"columbo" + 0.045*"bollywood" + 0.033*"classical" + 0.030*"indian" + 0.027*"india" + '
  '0.025*"khan" + 0.023*"shine" + 0.021*"muslim" + 0.015*"delight" + 0.015*"24"'),
 (14,
  '0.002*"episode" + 0.002*"game" + 0.002*"boy" + 0.002*"town" + 0.002*"remember" + 0.002*"small" '
  '+ 0.002*"lives" + 0.001*"mother" + 0.001*"girls" + 0.001*"picture"'),
 (15,
  '0.006*"jimmy" + 0.006*"notch" + 0.006*"stanwyck" + 0.006*"harris" + 0.005*"cook" + '
  '0.005*"barbara" + 0.005*"poignant" + 0.005*"mary" + 0.005*"susan" + 0.005*"douglas"'),
 (16,
  '0.029*"mario" + 0.023*"chaplin" + 0.023*"walt" + 0.018*"miike" + 0.018*"homeless" + '
  '0.017*"restored" + 0.016*"shirley" + 0.016*"astounding" + 0.015*"tarzan" + 0.015*"caine"'),
 (17,
  '0.037*"curly" + 0.027*"jazz" + 0.024*"swedish" + 0.024*"pleasantly" + 0.022*"shorts" + '
  '0.020*"clint" + 0.020*"werewolf" + 0.019*"peak" + 0.018*"wayne" + 0.018*"eastwood"'),
 (18,
  '0.002*"enjoyed" + 0.002*"realistic" + 0.002*"viewer" + 0.002*"glover" + 0.002*"loved" + '
  '0.002*"drama" + 0.002*"festival" + 0.002*"lives" + 0.002*"recommend" + 0.002*"experience"'),
 (19,
  '0.009*"cooper" + 0.008*"documentary" + 0.008*"warming" + 0.007*"unforgettable" + 0.006*"trek" + '
  '0.006*"fiction" + 0.006*"kirk" + 0.005*"revolution" + 0.005*"portrait" + 0.005*"format"')]

Usando Nuestro Modelo como Profiler

In [178]:
import requests
import json
r = requests.get("http://www.omdbapi.com/?i=tt0379889&apikey=ccedfaeb")
pp.pprint(r.json())
{'Actors': 'Al Pacino, Jeremy Irons, Joseph Fiennes, Lynn Collins',
 'Awards': 'Nominated for 1 BAFTA Film Award. Another 2 wins & 6 nominations.',
 'BoxOffice': '$3,300,000',
 'Country': 'USA, Italy, Luxembourg, UK',
 'DVD': '10 May 2005',
 'Director': 'Michael Radford',
 'Genre': 'Drama, Romance',
 'Language': 'English',
 'Metascore': '63',
 'Plot': 'In 16th century Venice, when a merchant must default on a large loan from an abused '
         'Jewish moneylender for a friend with romantic ambitions, the bitterly vengeful creditor '
         'demands a gruesome payment instead.',
 'Poster': 'https://m.media-amazon.com/images/M/MV5BMGJiNGUxZGYtM2U2YS00ZjJlLThlNjQtYTVkNWUxZGRmYTk4XkEyXkFqcGdeQXVyMTMxMTY0OTQ@._V1_SX300.jpg',
 'Production': 'Sony Pictures Classics',
 'Rated': 'R',
 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '7.1/10'},
             {'Source': 'Rotten Tomatoes', 'Value': '71%'},
             {'Source': 'Metacritic', 'Value': '63/100'}],
 'Released': '18 Feb 2005',
 'Response': 'True',
 'Runtime': '131 min',
 'Title': 'The Merchant of Venice',
 'Type': 'movie',
 'Website': 'http://www.sonypictures.com/classics/merchantofvenice/',
 'Writer': 'William Shakespeare (play), Michael Radford (screenplay)',
 'Year': '2004',
 'imdbID': 'tt0379889',
 'imdbRating': '7.1',
 'imdbVotes': '32,219'}

Analizando una Review Positiva

Para analizar las reviews, primero vamos a tokenizar el texto y lo vamos a convertir en una representación Bag of Words con proyección a nuestro diccionario. Esta representación es la que podemos pasar a nuestro modelo LDA para que nos devuelva la distribución de tópicos más probable sobre nuestro texto inicial

In [179]:
good_review_text = """I just saw this at the Toronto International Film Festival in the beautiful Elgin Theatre. 
I was blown away by the beautiful cinematography, the brilliant adaptation of a very tricky play and last 
but not least, the bravura performance of Al Pacino, who was born to play this role, 
which was perfectly balanced by an equally strong performance from Jeremy Irons.<br /><br />
The film deftly explores the themes of love vs loyalty, law vs justice, and passion vs reason. 
Some might protest that the content is inherently anti-semitic, 
however they should consider the historical context of the story, 
and the delicate and nuanced way in which it is told in this adaptation"""
good_review_tokens = analyzer(good_review_text)
lda_model_n.get_document_topics(dictionary.doc2bow(good_review_tokens))
Out[179]:
[(2, 0.20823306),
 (4, 0.077101797),
 (7, 0.10016536),
 (10, 0.067645445),
 (11, 0.40421197),
 (12, 0.11347574)]

Comprobamos cuales son los 10 tokens más prominentes del tópico asignado con más probabilidad, el tópico 11

In [110]:
def get_topic_tokens(model, topic_id):
    terms = model.show_topic(topic_id)
    return [item[0] for item in terms]
tokens = get_topic_tokens(lda_model_n, 11)
tokens
Out[110]:
['novel',
 'british',
 'german',
 'joe',
 'century',
 'period',
 'era',
 'adaptation',
 'jane',
 'henry']
In [181]:
shared_tokens = list(set(good_review_tokens) & set(tokens))
Out[181]:
['adaptation']

Análisis de Sentimiento sobre los Keywords de los Tópicos

In [93]:
pd.options.display.max_colwidth = -1
from IPython.display import display, HTML

def explore_opinions(text, keywords):
    from textblob import TextBlob
    blob = TextBlob(text)
    data = []
    for sentence in blob.sentences:
        for token in keywords:
            if token in sentence.words:
                data.append([token, sentence.__str__(), sentence.sentiment[0], sentence.sentiment[1]])
                
    df = pd.DataFrame(data)
    df.columns = ['Token', 'Sentence', 'Sentiment Polarity', 'Sentiment Subjectivity']
    return df
        
display(HTML(explore_opinions(good_review_text, shared_tokens).to_html().replace("\\n","<br>").replace('adaptation', '<strong>adaptation</strong>')))
Token Sentence Sentiment Polarity Sentiment Subjectivity
0 adaptation I was blown away by the beautiful cinematography, the brilliant adaptation of a very tricky play and last
but not least, the bravura performance of Al Pacino, who was born to play this role,
which was perfectly balanced by an equally strong performance from Jeremy Irons.<br /><br />
The film deftly explores the themes of love vs loyalty, law vs justice, and passion vs reason.
0.514815 0.666667
1 adaptation Some might protest that the content is inherently anti-semitic,
however they should consider the historical context of the story,
and the delicate and nuanced way in which it is told in this adaptation
-0.150000 0.450000

Analizando una Review Negativa

In [97]:
bad_review_text = """I have to admit that although I'm a fan of Shakespeare, 
I was never really familiar with this play. And what I really can't say is whether this is a poor adaptation, 
or whether the play is just a bad choice for film. 
There are some nice pieces of business in it, but the execution is very clunky and the plot is obvious. 
The theme of the play is on the nature of debt, using the financial idea of debt and justice as a 
metaphor for emotional questions. That becomes clear when the issue of the rings becomes more important than 
the business with Shylock, which unfortunately descends into garden variety anti-Semitisim despite 
the Bard's best attempts to salvage him with a couple nice monologues.<br /><br />
Outside of Jeremy Irons' dignified turn, I didn't think there was a decent performance in the bunch. 
Pacino's Yiddish consists of a slight whine added to the end of every pronouncement, and 
some of the better Shylock scenes are reduced to variations on the standard "Pacino gets angry" 
scene that his fans know and love. But Lynn Collins is outright embarrassing, to the point where I 
would have thought they would have screen-tested her right out of the picture early on. 
When she goes incognito as a man, it's hard not to laugh at all the things we're not supposed to laugh at. 
With Joseph Fiennes standing there trying to look sincere and complicated, it's hard not to make 
devastating comparisons to Gwyneth Paltrow's performance in "Shakespeare in Love." 
The big problem however that over-rides everything in this film is just a lack of emotional focus. 
It's really hard to tell whether this film is trying to be a somewhat serious comedy or a strangely silly drama. 
Surely a good summer stock performance would wring more laughs from the material than this somber production. 
The actors seem embarrassed to be attempting humor, and unsure of where to place dramatic and comedic emphasis. 
All of this is basically the fault of the director, Michael Radford, who seems to think that the material 
is a great deal heavier than it appears to me."""
bad_review_tokens = analyzer(bad_review_text)
lda_model_n.get_document_topics(dictionary.doc2bow(bad_review_tokens))
list(set(bad_review_tokens) & set(tokens))
Out[97]:
['adaptation']
In [98]:
display(HTML(explore_opinions(bad_review_text, shared_tokens).to_html().replace("\\n","<br>").replace('adaptation', '<strong>adaptation</strong>')))
Token Sentence Sentiment Polarity Sentiment Subjectivity
0 adaptation And what I really can't say is whether this is a poor adaptation,
or whether the play is just a bad choice for film.
-0.3 0.488889

Analizando Reviews Nuevas fuera del Corpus

In [186]:
bb_text = """Drug wars, meth, the lot. I thought no thank you. 
I kept hearing how good it was and I kept saying: "No thank you" 
Last January I got sick, one of those illnesses you can't quite figure out. 
Maybe it was pre and post election depression, I don't know. But I stayed in bed for almost 
10 days and then it happened. I saw the first episode and I was immediately and I mean immediately, 
hooked. I saw the entire series in 9 days. Voraciously. Now I had time to reflect. Why I wonder. 
When I think about it the first thing that comes to mind is not a thing it's Bryan Cranston. 
I know the concept was superb as was the writing but Bryan Cranston made it all real. 
His performance, the creation of Walter White will be studied in the Acting classes of the future. 
He is the one that pulls you forward - as well as backwards and sideways - then I realized that his 
creation acquired the power that it acquired, in great part thanks to the extraordinary cast of supporting players. 
I could write a page for each one of them but I'm just going to mention Aaron Paul. 
I ended up loving him. I developed a visceral need to see him find a way out. Well, what can I tell you. 
I know that one day, maybe when my kids are old enough, I shall see "Breaking Bad" again. I can't wait."""
bb_review_tokens = analyzer(bb_text)
lda_model_n.get_document_topics(dictionary.doc2bow(bb_review_tokens))
Out[186]:
[(0, 0.18458492),
 (3, 0.26951346),
 (6, 0.13031451),
 (9, 0.1640074),
 (10, 0.041175943),
 (12, 0.18852876)]
In [102]:
lda_model_n.show_topic(3)
Out[102]:
[('documentary', 0.011184368),
 ('understand', 0.010630106),
 ('person', 0.0099808462),
 ('lives', 0.0098009398),
 ('society', 0.0089843161),
 ('message', 0.0077291606),
 ('self', 0.0074037556),
 ('reality', 0.0072154128),
 ('experience', 0.0066043506),
 ('change', 0.0059017823)]
In [127]:
bb_text_2 = """What do you get when you have a chemistry teacher in a mid life crisis, dying of cancer, 
and washing cars as a second job to make ends meet for his middle class family? One of the greatest television 
dramas of all time with crazy plot twists, brilliant performances, and unforgettable characters and cinematography.
There is so much to like about the masterpiece that is Breaking Bad. Take your pick: the acting, 
the writing, the story lines, the plot, the suspense the cliff hangers, the action scenes, the camera work, 
the characters, the character arcs, the realism, the satirical style, any season, the end, the casting, the 
dark humor and humor relief, the scenery, the contrast between background and foreground to establish artistic 
effect (the sun shiny clear blue skies of the NM desert behind the gruesome organized crime and violence of the 
underworld), the mixing of favorite genres (crime caper, dark comedy, western, noir, horror, suspense, action, 
drama, thriller, Shakespearean tragedy, dystopia, psychological character study..), the lines/quotes...
the list goes on.
What's amazing about Breaking Bad is it begins so humble and quiet, and as it continues to let its' story unfold,
it explodes. It gets better and better each season until the end in the final season, we don't know if we're watching a
television show or an Academy Award winning motion picture. The show dares to go where no one would have thought 
it would go- into a transcendent realm of classic cinema- and it pulls it off beautifully."""
bb_review_tokens = analyzer(bb_text_2)
lda_model_n.get_document_topics(lda_model_n.id2word.doc2bow(bb_review_tokens))
Out[127]:
[(1, 0.16781522),
 (3, 0.038449161),
 (4, 0.091706149),
 (9, 0.20437928),
 (10, 0.26626715),
 (11, 0.068827115),
 (13, 0.086461544),
 (15, 0.039552584),
 (18, 0.026541797)]
In [114]:
print(list(set(bb_review_tokens) & set(get_topic_tokens(lda_model_n, 10))))
lda_model_n.show_topic(10)
['thriller', 'drama', 'cinema', 'picture', 'award']
Out[114]:
[('oscar', 0.017812377),
 ('actress', 0.014293833),
 ('song', 0.013259972),
 ('direction', 0.012200573),
 ('cinema', 0.011432062),
 ('romantic', 0.010717626),
 ('thriller', 0.010223594),
 ('picture', 0.0091878977),
 ('award', 0.0087983562),
 ('drama', 0.0087924525)]
In [115]:
print(list(set(bb_review_tokens) & set(get_topic_tokens(lda_model_n, 9))))
lda_model_n.show_topic(9)
['season', 'television']
Out[115]:
[('episode', 0.059329994),
 ('season', 0.02849194),
 ('space', 0.019145694),
 ('footage', 0.013079611),
 ('television', 0.012696351),
 ('jason', 0.011333764),
 ('mike', 0.010189135),
 ('pilot', 0.0091847815),
 ('trek', 0.0091349082),
 ('team', 0.0081657609)]
In [116]:
print(list(set(bb_review_tokens) & set(get_topic_tokens(lda_model_n, 1))))
lda_model_n.show_topic(1)
['noir', 'crime']
Out[116]:
[('police', 0.01869829),
 ('murder', 0.017076045),
 ('town', 0.016507478),
 ('crime', 0.013640339),
 ('noir', 0.010154511),
 ('prison', 0.0098637538),
 ('scott', 0.0093448637),
 ('mr', 0.0082623195),
 ('cop', 0.0079970136),
 ('finds', 0.0074484563)]